⭐ 1. Introduction & Overview¶
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
🔹 2. Import Libraries & Set Up¶
In [255]:
# General
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Machine Learning
import xgboost as xg
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error, r2_score, root_mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
# Feature Importance & Explainability
import shap
# Settings
import warnings
warnings.filterwarnings("ignore")
# Set random seed for reproducibility
SEED = 42
np.random.seed(SEED)
print("Libraries loaded. Ready to go!")
Libraries loaded. Ready to go!
🔹 3. Load & Explore Data¶
In [256]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
In [257]:
train.head()
Out[257]:
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
In [258]:
train.shape
Out[258]:
(1460, 81)
In [259]:
train.isnull().sum()
Out[259]:
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
...
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 81, dtype: int64
In [260]:
# Quick summary of dataset
train.describe()
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 588 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
🔹 4. Data Visualization & EDA¶
In [261]:
float_cols = [col for col in train.columns if train[col].dtype == "float64"]
cols_per_row = 3
num_plots = len(float_cols)
rows = (num_plots // cols_per_row) + (num_plots % cols_per_row > 0)
fig, axes = plt.subplots(rows, cols_per_row, figsize=(15, 5 * rows))
axes = axes.flatten()
for idx, col in enumerate(float_cols):
sns.histplot(train[col], bins=50, kde=True, ax=axes[idx])
axes[idx].set_title(f"Distribution of {col}")
for i in range(idx + 1, len(axes)):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
In [262]:
categorical_features = train.select_dtypes(include=['object']).columns
num_features = len(categorical_features)
cols = 3
rows = (num_features // cols) + (num_features % cols > 0)
# Create subplots
fig, axes = plt.subplots(rows, cols, figsize=(15, rows * 5))
axes = axes.flatten()
for i, feature in enumerate(categorical_features):
train[feature].value_counts().plot.pie(
autopct='%1.1f%%', ax=axes[i], startangle=90, cmap="viridis"
)
axes[i].set_title(feature)
axes[i].set_ylabel("")
# Hide any unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
In [263]:
heatmap_train = pd.DataFrame()
for col in train.columns:
if train[col].dtype == "float64" or train[col].dtype == "int64":
heatmap_train[col] = train[col]
plt.figure(figsize=(30,12))
sns.heatmap(heatmap_train.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()
In [264]:
heatmap_train = train.select_dtypes(include=["float64", "int64"])
corr_matrix = heatmap_train.corr()
threshold = 0.75
high_corr_pairs = (
corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
.stack()
.reset_index()
)
high_corr_pairs.columns = ["Feature 1", "Feature 2", "Correlation"]
high_corr_pairs = high_corr_pairs[high_corr_pairs["Correlation"].abs() > threshold]
plt.figure(figsize=(30, 12))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()
print("Highly correlated feature pairs (above threshold):")
print(high_corr_pairs)
Highly correlated feature pairs (above threshold):
Feature 1 Feature 2 Correlation
174 OverallQual SalePrice 0.790982
225 YearBuilt GarageYrBlt 0.825667
378 TotalBsmtSF 1stFlrSF 0.819530
478 GrLivArea TotRmsAbvGrd 0.825489
637 GarageCars GarageArea 0.882475
In [265]:
#interesting_features = ["OverallQual", "YearBuilt", "GarageYrBlt", "TotalBsmtSF", "1stFlrSF", "GrLivArea", "TotRmsAbvGrd", "GarageCars", "GarageArea"]
l1 = high_corr_pairs['Feature 1'].tolist()
l2 = high_corr_pairs['Feature 2'].tolist()
interesting_features = list(set(l1+l2))
interesting_features.remove('SalePrice')
print(interesting_features)
['GarageYrBlt', 'TotalBsmtSF', 'OverallQual', '1stFlrSF', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'GrLivArea', 'YearBuilt']
🔹 5. Feature Engineering¶
In [266]:
train.columns = train.columns.str.strip()
test.columns = test.columns.str.strip()
In [267]:
print(f"Train set, null count: \n{train.isnull().sum()}")
print("\n")
print(f"Test set, null count: \n{test.isnull().sum()}")
Train set, null count:
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
...
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 81, dtype: int64
Test set, null count:
Id 0
MSSubClass 0
MSZoning 4
LotFrontage 227
LotArea 0
...
MiscVal 0
MoSold 0
YrSold 0
SaleType 1
SaleCondition 0
Length: 80, dtype: int64
In [268]:
train["LotFrontage"] = train.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
test["LotFrontage"] = test.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
train[col] = train[col].fillna('None')
test[col] = test[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
train[col] = train[col].fillna(0)
test[col] = test[col].fillna(0)
train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']
test['TotalSF'] = test['TotalBsmtSF'] + test['1stFlrSF'] + test['2ndFlrSF']
In [269]:
for col in train.columns:
if train[col].dtype == "object":
train[col] = train[col].fillna("None")
elif train[col].dtype in ["float64", "int64"]:
train[col] = train[col].fillna(train[col].mean())
for col in test.columns:
if test[col].dtype == "object":
test[col] = test[col].fillna("None")
elif test[col].dtype in ["float64", "int64"]:
test[col] = test[col].fillna(test[col].mean())
In [270]:
for col in train.columns:
if train[col].isnull().sum() > 0:
print(col)
for col in test.columns:
if test[col].isnull().sum() > 0:
print(col)
No more empty items left. Great!
In [271]:
import itertools
def create_combination_features(df, features):
combinations = itertools.combinations(features, 2)
for comb in combinations:
feature_name = "_".join(comb)
df[feature_name] = df[list(comb)].mean(axis=1)
return df
train = create_combination_features(train, interesting_features)
test = create_combination_features(test, interesting_features)
In [272]:
train.head()
Out[272]:
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | TotRmsAbvGrd_GarageCars | TotRmsAbvGrd_GarageArea | TotRmsAbvGrd_GrLivArea | TotRmsAbvGrd_YearBuilt | GarageCars_GarageArea | GarageCars_GrLivArea | GarageCars_YearBuilt | GarageArea_GrLivArea | GarageArea_YearBuilt | GrLivArea_YearBuilt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | None | Reg | Lvl | AllPub | ... | 5.0 | 278.0 | 859.0 | 1005.5 | 275.0 | 856.0 | 1002.5 | 1129.0 | 1275.5 | 1856.5 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | None | Reg | Lvl | AllPub | ... | 4.0 | 233.0 | 634.0 | 991.0 | 231.0 | 632.0 | 989.0 | 861.0 | 1218.0 | 1619.0 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | None | IR1 | Lvl | AllPub | ... | 4.0 | 307.0 | 896.0 | 1003.5 | 305.0 | 894.0 | 1001.5 | 1197.0 | 1304.5 | 1893.5 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | None | IR1 | Lvl | AllPub | ... | 5.0 | 324.5 | 862.0 | 961.0 | 322.5 | 860.0 | 959.0 | 1179.5 | 1278.5 | 1816.0 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | None | IR1 | Lvl | AllPub | ... | 6.0 | 422.5 | 1103.5 | 1004.5 | 419.5 | 1100.5 | 1001.5 | 1517.0 | 1418.0 | 2099.0 |
5 rows × 118 columns
In [273]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in train.columns:
if train[col].dtype == "object":
train[col] = le.fit_transform(train[col])
for col in test.columns:
if test[col].dtype == "object":
test[col] = le.fit_transform(test[col])
🔹 6. Model Selection¶
In [274]:
X = train.drop(columns=["Id", "SalePrice"])
X_test = test.drop(columns=["Id"])
y = train['SalePrice']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=SEED)
In [275]:
param_grid = {
#'n_estimators': [100, 200, 500],
#'learning_rate': [0.01, 0.05, 0.1],
#'max_depth': [3, 5, 7, 9],
#'subsample': [0.8, 0.9, 1.0],
#'colsample_bytree': [0.8, 0.9, 1.0],
'alpha': [0, 0.01, 0.1, 1],
'lambda': [0, 0.1, 0.5, 1],
'gamma': [0, 0.1, 0.2, 1],
'early_stopping_rounds': [5, 10, 20, 30]
}
grid_search = GridSearchCV(xg.XGBRegressor(tree_method="gpu_hist", random_state=SEED), param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_val, y_val)])
print("Best Parameters:", grid_search.best_params_)
best_params = grid_search.best_params_
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) Cell In[275], line 14 1 param_grid = { 2 #'n_estimators': [100, 200, 500], 3 #'learning_rate': [0.01, 0.05, 0.1], (...) 10 'early_stopping_rounds': [5, 10, 20, 30] 11 } 13 grid_search = GridSearchCV(xg.XGBRegressor(tree_method="gpu_hist", random_state=SEED), param_grid, cv=5, n_jobs=-1) ---> 14 grid_search.fit(X_train, y_train, 15 eval_set=[(X_train, y_train), (X_val, y_val)]) 17 print("Best Parameters:", grid_search.best_params_) 19 best_params = grid_search.best_params_ File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs) 1466 estimator._validate_params() 1468 with config_context( 1469 skip_parameter_validation=( 1470 prefer_skip_nested_validation or global_skip_validation 1471 ) 1472 ): -> 1473 return fit_method(estimator, *args, **kwargs) File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:1018, in BaseSearchCV.fit(self, X, y, **params) 1012 results = self._format_results( 1013 all_candidate_params, n_splits, all_out, all_more_results 1014 ) 1016 return results -> 1018 self._run_search(evaluate_candidates) 1020 # multimetric is determined here because in the case of a callable 1021 # self.scoring the return type is only known after calling 1022 first_test_score = all_out[0]["test_scores"] File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:1572, in GridSearchCV._run_search(self, evaluate_candidates) 1570 def _run_search(self, evaluate_candidates): 1571 """Search all candidates in param_grid""" -> 1572 evaluate_candidates(ParameterGrid(self.param_grid)) File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:964, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results) 956 if self.verbose > 0: 957 print( 958 "Fitting {0} folds for each of {1} candidates," 959 " totalling {2} fits".format( 960 n_splits, n_candidates, n_candidates * n_splits 961 ) 962 ) --> 964 out = parallel( 965 delayed(_fit_and_score)( 966 clone(base_estimator), 967 X, 968 y, 969 train=train, 970 test=test, 971 parameters=parameters, 972 split_progress=(split_idx, n_splits), 973 candidate_progress=(cand_idx, n_candidates), 974 **fit_and_score_kwargs, 975 ) 976 for (cand_idx, parameters), (split_idx, (train, test)) in product( 977 enumerate(candidate_params), 978 enumerate(cv.split(X, y, **routed_params.splitter.split)), 979 ) 980 ) 982 if len(out) < 1: 983 raise ValueError( 984 "No fits were performed. " 985 "Was the CV iterator empty? " 986 "Were there no candidates?" 987 ) File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\parallel.py:74, in Parallel.__call__(self, iterable) 69 config = get_config() 70 iterable_with_config = ( 71 (_with_config(delayed_func, config), args, kwargs) 72 for delayed_func, args, kwargs in iterable 73 ) ---> 74 return super().__call__(iterable_with_config) File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py:2007, in Parallel.__call__(self, iterable) 2001 # The first item from the output is blank, but it makes the interpreter 2002 # progress until it enters the Try/Except block of the generator and 2003 # reaches the first `yield` statement. This starts the asynchronous 2004 # dispatch of the tasks to the workers. 2005 next(output) -> 2007 return output if self.return_generator else list(output) File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py:1650, in Parallel._get_outputs(self, iterator, pre_dispatch) 1647 yield 1649 with self._backend.retrieval_context(): -> 1650 yield from self._retrieve() 1652 except GeneratorExit: 1653 # The generator has been garbage collected before being fully 1654 # consumed. This aborts the remaining tasks if possible and warn 1655 # the user if necessary. 1656 self._exception = True File c:\Users\robkr\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\parallel.py:1762, in Parallel._retrieve(self) 1757 # If the next job is not ready for retrieval yet, we just wait for 1758 # async callbacks to progress. 1759 if ((len(self._jobs) == 0) or 1760 (self._jobs[0].get_status( 1761 timeout=self.timeout) == TASK_PENDING)): -> 1762 time.sleep(0.01) 1763 continue 1765 # We need to be careful: the job list can be filling up as 1766 # we empty it and Python list are not thread-safe by 1767 # default hence the use of the lock KeyboardInterrupt:
In [282]:
model = xg.XGBRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=6,
early_stopping_rounds=30,
random_state=SEED)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_val, y_val)])
results = model.evals_result()
plt.figure(figsize=(10,7))
plt.plot(results["validation_0"]["rmse"], label="Training loss")
plt.plot(results["validation_1"]["rmse"], label="Validation loss")
plt.axvline(21, color="gray", label="Optimal tree number")
plt.xlabel("Number of trees")
plt.ylabel("Loss")
plt.legend()
predictions = model.predict(X_val)
mse = mean_squared_error(y_val, predictions)
mae = mean_absolute_error(y_val, predictions)
r2 = r2_score(y_val, predictions)
rms = root_mean_squared_error(y_val, predictions)
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R² Score: {r2}")
print(f"RMSE Score: {rms}")
[0] validation_0-rmse:70970.99675 validation_1-rmse:81185.69429
[1] validation_0-rmse:65389.63001 validation_1-rmse:75536.54372 [2] validation_0-rmse:60325.50352 validation_1-rmse:70391.48674 [3] validation_0-rmse:55779.77828 validation_1-rmse:65897.70937 [4] validation_0-rmse:51688.05832 validation_1-rmse:61486.43675 [5] validation_0-rmse:48065.76245 validation_1-rmse:57493.91657 [6] validation_0-rmse:44762.08765 validation_1-rmse:54121.95396 [7] validation_0-rmse:41787.54140 validation_1-rmse:51068.12859 [8] validation_0-rmse:39174.55442 validation_1-rmse:48461.50333 [9] validation_0-rmse:36846.50650 validation_1-rmse:46160.89333 [10] validation_0-rmse:34761.88632 validation_1-rmse:44186.04997 [11] validation_0-rmse:32901.03478 validation_1-rmse:42473.86017 [12] validation_0-rmse:31105.18465 validation_1-rmse:40959.48824 [13] validation_0-rmse:29542.99758 validation_1-rmse:39595.32568 [14] validation_0-rmse:28124.51338 validation_1-rmse:38338.29472 [15] validation_0-rmse:26930.49823 validation_1-rmse:37394.57382 [16] validation_0-rmse:25759.82301 validation_1-rmse:36433.66961 [17] validation_0-rmse:24692.30199 validation_1-rmse:35612.50055 [18] validation_0-rmse:23724.48250 validation_1-rmse:34889.01702 [19] validation_0-rmse:22904.96392 validation_1-rmse:34120.27285 [20] validation_0-rmse:22146.97539 validation_1-rmse:33488.32512 [21] validation_0-rmse:21451.91814 validation_1-rmse:32986.14746 [22] validation_0-rmse:20844.12469 validation_1-rmse:32514.35349 [23] validation_0-rmse:20278.17371 validation_1-rmse:32265.58492 [24] validation_0-rmse:19769.23424 validation_1-rmse:32065.88325 [25] validation_0-rmse:19330.95042 validation_1-rmse:31783.42352 [26] validation_0-rmse:18938.96910 validation_1-rmse:31452.40945 [27] validation_0-rmse:18469.84238 validation_1-rmse:31193.25027 [28] validation_0-rmse:18056.81178 validation_1-rmse:30963.57389 [29] validation_0-rmse:17699.04077 validation_1-rmse:30800.24726 [30] validation_0-rmse:17380.23489 validation_1-rmse:30656.27114 [31] validation_0-rmse:17042.93197 validation_1-rmse:30421.24088 [32] validation_0-rmse:16728.89263 validation_1-rmse:30250.88544 [33] validation_0-rmse:16402.86279 validation_1-rmse:30290.47945 [34] validation_0-rmse:16063.11172 validation_1-rmse:30207.54612 [35] validation_0-rmse:15800.82453 validation_1-rmse:30202.17239 [36] validation_0-rmse:15593.26356 validation_1-rmse:30148.52294 [37] validation_0-rmse:15362.04726 validation_1-rmse:30144.52077 [38] validation_0-rmse:15209.44196 validation_1-rmse:30077.06972 [39] validation_0-rmse:15071.16011 validation_1-rmse:30031.71573 [40] validation_0-rmse:14855.41890 validation_1-rmse:30008.15365 [41] validation_0-rmse:14729.30351 validation_1-rmse:29971.12883 [42] validation_0-rmse:14619.09227 validation_1-rmse:30000.85843 [43] validation_0-rmse:14516.68055 validation_1-rmse:30017.30245 [44] validation_0-rmse:14434.56843 validation_1-rmse:29985.43943 [45] validation_0-rmse:14267.31069 validation_1-rmse:29987.22929 [46] validation_0-rmse:14186.62625 validation_1-rmse:30063.01536 [47] validation_0-rmse:14114.96128 validation_1-rmse:30071.45333 [48] validation_0-rmse:14045.32317 validation_1-rmse:30136.62240 [49] validation_0-rmse:13878.71972 validation_1-rmse:30132.72088 [50] validation_0-rmse:13756.14819 validation_1-rmse:30133.40163 [51] validation_0-rmse:13670.45072 validation_1-rmse:30137.18406 [52] validation_0-rmse:13589.24230 validation_1-rmse:30189.83340 [53] validation_0-rmse:13534.02417 validation_1-rmse:30159.16439 [54] validation_0-rmse:13402.63195 validation_1-rmse:30209.28167 [55] validation_0-rmse:13325.79715 validation_1-rmse:30179.88467 [56] validation_0-rmse:13252.86668 validation_1-rmse:30157.87934 [57] validation_0-rmse:13189.35689 validation_1-rmse:30120.98764 [58] validation_0-rmse:13122.00641 validation_1-rmse:30111.69700 [59] validation_0-rmse:13073.39773 validation_1-rmse:30121.59937 [60] validation_0-rmse:13016.42063 validation_1-rmse:30109.56940 [61] validation_0-rmse:12905.79256 validation_1-rmse:30123.86148 [62] validation_0-rmse:12817.74617 validation_1-rmse:30118.21006 [63] validation_0-rmse:12778.45647 validation_1-rmse:30104.93015 [64] validation_0-rmse:12693.12025 validation_1-rmse:30125.27322 [65] validation_0-rmse:12521.20022 validation_1-rmse:30167.86210 [66] validation_0-rmse:12439.29640 validation_1-rmse:30171.20592 [67] validation_0-rmse:12375.86631 validation_1-rmse:30161.93348 [68] validation_0-rmse:12309.27041 validation_1-rmse:30164.23156 [69] validation_0-rmse:12269.31452 validation_1-rmse:30149.12402 [70] validation_0-rmse:12217.43429 validation_1-rmse:30152.37377 [71] validation_0-rmse:12163.80516 validation_1-rmse:30166.35987 Mean Squared Error: 898268566.4025571 Mean Absolute Error: 19523.425660851884 R² Score: 0.8828904032707214 RMSE Score: 29971.12888101743
In [277]:
X_test = test.drop(columns=['Id'])
predictions = model.predict(X_test)
output = pd.DataFrame({'Id': test['Id'], 'SalePrice': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")
Your submission was successfully saved!
🔹 Experiment¶
In [292]:
y = train["SalePrice"]
X = pd.get_dummies(train.drop(columns=["SalePrice"]))
X_test = pd.get_dummies(test)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=SEED)
model = xg.XGBRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=6,
early_stopping_rounds=20,
alpha=0.1,
lambda_=0.1,
gamma=0.1,
random_state=SEED)
model.fit(X, y,
eval_set=[(X_train, y_train), (X_val, y_val)])
predictions = model.predict(X_test)
predictions_val = model.predict(X_val)
mse = mean_squared_error(y_val, predictions_val)
mae = mean_absolute_error(y_val, predictions_val)
r2 = r2_score(y_val, predictions_val)
rms = root_mean_squared_error(y_val, predictions_val)
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R² Score: {r2}")
print(f"RMSE Score: {rms}")
output = pd.DataFrame({'Id': test['Id'], 'SalePrice': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")
[0] validation_0-rmse:70587.64987 validation_1-rmse:80031.70279 [1] validation_0-rmse:64649.68176 validation_1-rmse:73184.89989 [1] validation_0-rmse:64649.68176 validation_1-rmse:73184.89989 [2] validation_0-rmse:59242.00549 validation_1-rmse:67185.38236 [3] validation_0-rmse:54359.50865 validation_1-rmse:61895.44966 [4] validation_0-rmse:49975.55922 validation_1-rmse:56932.04077 [5] validation_0-rmse:46056.33986 validation_1-rmse:52665.26736 [6] validation_0-rmse:42466.32732 validation_1-rmse:48643.13051 [7] validation_0-rmse:39249.99216 validation_1-rmse:45025.81572 [8] validation_0-rmse:36330.76481 validation_1-rmse:41553.31432 [9] validation_0-rmse:33654.57960 validation_1-rmse:38626.54093 [10] validation_0-rmse:31313.60302 validation_1-rmse:35821.70997 [11] validation_0-rmse:29110.11036 validation_1-rmse:33341.97377 [12] validation_0-rmse:27103.52663 validation_1-rmse:31081.01471 [13] validation_0-rmse:25287.70466 validation_1-rmse:29084.07657 [14] validation_0-rmse:23659.23322 validation_1-rmse:27250.29113 [15] validation_0-rmse:22147.78931 validation_1-rmse:25627.62607 [16] validation_0-rmse:20838.57970 validation_1-rmse:24102.99901 [17] validation_0-rmse:19650.22950 validation_1-rmse:22670.59125 [18] validation_0-rmse:18587.57553 validation_1-rmse:21453.46037 [19] validation_0-rmse:17613.47630 validation_1-rmse:20314.02078 [20] validation_0-rmse:16757.09225 validation_1-rmse:19266.49970 [21] validation_0-rmse:15981.54455 validation_1-rmse:18293.96320 [22] validation_0-rmse:15285.74510 validation_1-rmse:17462.45386 [23] validation_0-rmse:14674.46288 validation_1-rmse:16632.30008 [24] validation_0-rmse:14073.28338 validation_1-rmse:15877.35422 [25] validation_0-rmse:13541.73838 validation_1-rmse:15201.78559 [26] validation_0-rmse:13063.53557 validation_1-rmse:14571.93830 [27] validation_0-rmse:12618.95625 validation_1-rmse:13982.81317 [28] validation_0-rmse:12213.15044 validation_1-rmse:13416.22001 [29] validation_0-rmse:11831.96912 validation_1-rmse:12885.53252 [30] validation_0-rmse:11485.12523 validation_1-rmse:12391.00656 [31] validation_0-rmse:11175.60498 validation_1-rmse:11957.29005 [32] validation_0-rmse:10851.96172 validation_1-rmse:11575.84600 [33] validation_0-rmse:10566.09478 validation_1-rmse:11211.75370 [34] validation_0-rmse:10322.16180 validation_1-rmse:10813.40977 [35] validation_0-rmse:10124.33637 validation_1-rmse:10501.44477 [36] validation_0-rmse:9889.28919 validation_1-rmse:10142.20931 [37] validation_0-rmse:9677.60613 validation_1-rmse:9835.03811 [38] validation_0-rmse:9476.21101 validation_1-rmse:9602.87599 [39] validation_0-rmse:9300.46041 validation_1-rmse:9388.69295 [40] validation_0-rmse:9110.44486 validation_1-rmse:9204.20090 [41] validation_0-rmse:8952.29241 validation_1-rmse:9034.83823 [42] validation_0-rmse:8765.06154 validation_1-rmse:8835.48348 [43] validation_0-rmse:8625.50611 validation_1-rmse:8670.94196 [44] validation_0-rmse:8493.23515 validation_1-rmse:8500.11982 [45] validation_0-rmse:8350.93387 validation_1-rmse:8374.53401 [46] validation_0-rmse:8225.78918 validation_1-rmse:8167.25119 [47] validation_0-rmse:8099.35086 validation_1-rmse:8071.30511 [48] validation_0-rmse:7984.65752 validation_1-rmse:7922.81929 [49] validation_0-rmse:7886.92947 validation_1-rmse:7850.71196 [50] validation_0-rmse:7794.60942 validation_1-rmse:7755.77065 [51] validation_0-rmse:7708.24187 validation_1-rmse:7658.16723 [52] validation_0-rmse:7620.49625 validation_1-rmse:7551.95646 [53] validation_0-rmse:7569.74714 validation_1-rmse:7481.04963 [54] validation_0-rmse:7465.39791 validation_1-rmse:7379.62983 [55] validation_0-rmse:7371.19594 validation_1-rmse:7273.97990 [56] validation_0-rmse:7278.57362 validation_1-rmse:7198.43798 [57] validation_0-rmse:7173.35822 validation_1-rmse:7114.04825 [58] validation_0-rmse:7136.87679 validation_1-rmse:7069.84567 [59] validation_0-rmse:7058.10708 validation_1-rmse:6987.45067 [60] validation_0-rmse:6988.97546 validation_1-rmse:6912.30361 [61] validation_0-rmse:6892.58579 validation_1-rmse:6835.89551 [62] validation_0-rmse:6819.68506 validation_1-rmse:6757.16717 [63] validation_0-rmse:6748.45368 validation_1-rmse:6685.19423 [64] validation_0-rmse:6681.94260 validation_1-rmse:6642.21978 [65] validation_0-rmse:6601.94817 validation_1-rmse:6572.81010 [66] validation_0-rmse:6505.71789 validation_1-rmse:6478.77918 [67] validation_0-rmse:6449.04379 validation_1-rmse:6407.14816 [68] validation_0-rmse:6404.63080 validation_1-rmse:6377.95603 [69] validation_0-rmse:6357.67795 validation_1-rmse:6330.71900 [70] validation_0-rmse:6299.29617 validation_1-rmse:6280.65951 [71] validation_0-rmse:6235.32342 validation_1-rmse:6214.64317 [72] validation_0-rmse:6208.95556 validation_1-rmse:6180.39512 [73] validation_0-rmse:6157.02330 validation_1-rmse:6128.25658 [74] validation_0-rmse:6127.79588 validation_1-rmse:6105.01079 [75] validation_0-rmse:6051.12253 validation_1-rmse:6032.61492 [76] validation_0-rmse:6010.87126 validation_1-rmse:5993.43207 [77] validation_0-rmse:5993.50053 validation_1-rmse:5976.64015 [78] validation_0-rmse:5939.30511 validation_1-rmse:5929.25695 [79] validation_0-rmse:5912.48947 validation_1-rmse:5902.42666 [80] validation_0-rmse:5815.24231 validation_1-rmse:5832.58434 [81] validation_0-rmse:5797.36301 validation_1-rmse:5814.00680 [82] validation_0-rmse:5749.41861 validation_1-rmse:5793.56690 [83] validation_0-rmse:5677.89215 validation_1-rmse:5743.61818 [84] validation_0-rmse:5633.00051 validation_1-rmse:5699.19558 [85] validation_0-rmse:5594.40849 validation_1-rmse:5653.17000 [86] validation_0-rmse:5580.69699 validation_1-rmse:5640.91717 [87] validation_0-rmse:5566.39635 validation_1-rmse:5613.70050 [88] validation_0-rmse:5520.50828 validation_1-rmse:5560.43383 [89] validation_0-rmse:5489.74263 validation_1-rmse:5529.57610 [90] validation_0-rmse:5449.62668 validation_1-rmse:5510.24936 [91] validation_0-rmse:5418.45678 validation_1-rmse:5473.16424 [92] validation_0-rmse:5388.37255 validation_1-rmse:5440.24021 [93] validation_0-rmse:5362.28881 validation_1-rmse:5425.56233 [94] validation_0-rmse:5322.91534 validation_1-rmse:5389.84024 [95] validation_0-rmse:5294.96160 validation_1-rmse:5368.44202 [96] validation_0-rmse:5282.40069 validation_1-rmse:5354.68423 [97] validation_0-rmse:5242.28312 validation_1-rmse:5322.16064 [98] validation_0-rmse:5202.26545 validation_1-rmse:5287.06498 [99] validation_0-rmse:5174.85089 validation_1-rmse:5260.35961 [100] validation_0-rmse:5144.72702 validation_1-rmse:5232.19433 [101] validation_0-rmse:5101.61189 validation_1-rmse:5185.68563 [102] validation_0-rmse:5060.41015 validation_1-rmse:5154.68045 [103] validation_0-rmse:5025.89247 validation_1-rmse:5117.58963 [104] validation_0-rmse:5009.78968 validation_1-rmse:5105.60818 [105] validation_0-rmse:4988.40165 validation_1-rmse:5087.51981 [106] validation_0-rmse:4974.69629 validation_1-rmse:5068.66525 [107] validation_0-rmse:4952.76548 validation_1-rmse:5052.18994 [108] validation_0-rmse:4887.03583 validation_1-rmse:4976.90300 [109] validation_0-rmse:4874.04496 validation_1-rmse:4946.23736 [110] validation_0-rmse:4849.61283 validation_1-rmse:4917.60782 [111] validation_0-rmse:4835.78535 validation_1-rmse:4899.17147 [112] validation_0-rmse:4789.57721 validation_1-rmse:4854.19617 [113] validation_0-rmse:4766.66454 validation_1-rmse:4828.94202 [114] validation_0-rmse:4727.60927 validation_1-rmse:4803.42931 [115] validation_0-rmse:4676.17932 validation_1-rmse:4745.15357 [116] validation_0-rmse:4650.10997 validation_1-rmse:4718.29714 [117] validation_0-rmse:4619.37273 validation_1-rmse:4690.31852 [118] validation_0-rmse:4573.83564 validation_1-rmse:4649.45849 [119] validation_0-rmse:4547.31865 validation_1-rmse:4619.54640 [120] validation_0-rmse:4491.15697 validation_1-rmse:4558.28503 [121] validation_0-rmse:4441.91897 validation_1-rmse:4508.46869 [122] validation_0-rmse:4408.91937 validation_1-rmse:4479.93339 [123] validation_0-rmse:4361.35768 validation_1-rmse:4429.59941 [124] validation_0-rmse:4314.05256 validation_1-rmse:4381.95376 [125] validation_0-rmse:4258.36031 validation_1-rmse:4327.34349 [126] validation_0-rmse:4244.66227 validation_1-rmse:4302.72297 [127] validation_0-rmse:4225.99476 validation_1-rmse:4285.42942 [128] validation_0-rmse:4217.47108 validation_1-rmse:4277.20420 [129] validation_0-rmse:4189.42821 validation_1-rmse:4246.81239 [130] validation_0-rmse:4164.32663 validation_1-rmse:4216.79614 [131] validation_0-rmse:4144.25398 validation_1-rmse:4190.97523 [132] validation_0-rmse:4112.07313 validation_1-rmse:4148.61536 [133] validation_0-rmse:4102.16972 validation_1-rmse:4128.64277 [134] validation_0-rmse:4077.25396 validation_1-rmse:4104.30039 [135] validation_0-rmse:4056.84414 validation_1-rmse:4085.67558 [136] validation_0-rmse:4044.31283 validation_1-rmse:4074.29904 [137] validation_0-rmse:4014.01197 validation_1-rmse:4049.64520 [138] validation_0-rmse:3992.64410 validation_1-rmse:4028.27538 [139] validation_0-rmse:3964.81126 validation_1-rmse:3990.36035 [140] validation_0-rmse:3913.75930 validation_1-rmse:3953.12549 [141] validation_0-rmse:3890.90020 validation_1-rmse:3941.50577 [142] validation_0-rmse:3846.02751 validation_1-rmse:3885.46475 [143] validation_0-rmse:3842.45550 validation_1-rmse:3879.50182 [144] validation_0-rmse:3823.98294 validation_1-rmse:3871.50742 [145] validation_0-rmse:3767.71679 validation_1-rmse:3808.29943 [146] validation_0-rmse:3737.63180 validation_1-rmse:3787.87498 [147] validation_0-rmse:3719.42349 validation_1-rmse:3767.07622 [148] validation_0-rmse:3708.98350 validation_1-rmse:3751.79678 [149] validation_0-rmse:3695.12640 validation_1-rmse:3746.80037 [150] validation_0-rmse:3679.49772 validation_1-rmse:3730.16574 [151] validation_0-rmse:3630.48808 validation_1-rmse:3685.79404 [152] validation_0-rmse:3609.38037 validation_1-rmse:3665.73230 [153] validation_0-rmse:3576.39377 validation_1-rmse:3626.26843 [154] validation_0-rmse:3552.64785 validation_1-rmse:3608.97743 [155] validation_0-rmse:3531.24669 validation_1-rmse:3573.57183 [156] validation_0-rmse:3503.82710 validation_1-rmse:3554.14229 [157] validation_0-rmse:3443.69260 validation_1-rmse:3517.59897 [158] validation_0-rmse:3436.36577 validation_1-rmse:3511.07014 [159] validation_0-rmse:3415.63803 validation_1-rmse:3496.55568 [160] validation_0-rmse:3381.27854 validation_1-rmse:3466.26842 [161] validation_0-rmse:3346.78595 validation_1-rmse:3419.65418 [162] validation_0-rmse:3319.95810 validation_1-rmse:3386.55903 [163] validation_0-rmse:3294.38413 validation_1-rmse:3361.04764 [164] validation_0-rmse:3267.84582 validation_1-rmse:3338.48655 [165] validation_0-rmse:3245.09032 validation_1-rmse:3324.24743 [166] validation_0-rmse:3223.23321 validation_1-rmse:3312.40744 [167] validation_0-rmse:3218.06584 validation_1-rmse:3309.82755 [168] validation_0-rmse:3203.86517 validation_1-rmse:3300.54403 [169] validation_0-rmse:3165.38852 validation_1-rmse:3275.33392 [170] validation_0-rmse:3145.66778 validation_1-rmse:3264.60394 [171] validation_0-rmse:3124.46120 validation_1-rmse:3242.11976 [172] validation_0-rmse:3115.78265 validation_1-rmse:3236.92548 [173] validation_0-rmse:3097.40372 validation_1-rmse:3226.08651 [174] validation_0-rmse:3074.52883 validation_1-rmse:3205.52744 [175] validation_0-rmse:3066.97928 validation_1-rmse:3202.01259 [176] validation_0-rmse:3029.02029 validation_1-rmse:3162.68539 [177] validation_0-rmse:3018.13234 validation_1-rmse:3159.12957 [178] validation_0-rmse:2976.18369 validation_1-rmse:3126.48747 [179] validation_0-rmse:2947.34827 validation_1-rmse:3088.67305 [180] validation_0-rmse:2913.13894 validation_1-rmse:3052.62399 [181] validation_0-rmse:2883.91406 validation_1-rmse:3023.61200 [182] validation_0-rmse:2854.28859 validation_1-rmse:2988.84954 [183] validation_0-rmse:2835.13490 validation_1-rmse:2972.10465 [184] validation_0-rmse:2824.07141 validation_1-rmse:2957.72309 [185] validation_0-rmse:2809.90190 validation_1-rmse:2946.91391 [186] validation_0-rmse:2784.31360 validation_1-rmse:2923.10220 [187] validation_0-rmse:2753.50491 validation_1-rmse:2884.24948 [188] validation_0-rmse:2726.44551 validation_1-rmse:2853.00564 [189] validation_0-rmse:2719.54164 validation_1-rmse:2848.69755 [190] validation_0-rmse:2708.18044 validation_1-rmse:2831.37940 [191] validation_0-rmse:2681.38562 validation_1-rmse:2789.62957 [192] validation_0-rmse:2642.87954 validation_1-rmse:2756.96800 [193] validation_0-rmse:2602.38742 validation_1-rmse:2722.92170 [194] validation_0-rmse:2591.99517 validation_1-rmse:2711.62501 [195] validation_0-rmse:2567.58094 validation_1-rmse:2692.85217 [196] validation_0-rmse:2555.80563 validation_1-rmse:2674.29760 [197] validation_0-rmse:2529.68248 validation_1-rmse:2645.25917 [198] validation_0-rmse:2507.05234 validation_1-rmse:2624.34462 [199] validation_0-rmse:2467.73762 validation_1-rmse:2593.10381 Mean Squared Error: 6724187.362976126 Mean Absolute Error: 1856.9438811001712 R² Score: 0.9991233348846436 RMSE Score: 2593.1038087543134 Your submission was successfully saved!